-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix/review-issue-2 #4
Conversation
…d, mention memory limit in readme, readme mention website content crawler, crawler config only non defaults, added logging and minor code improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, a couple more comments as previously, I only did a quick pass.
main.py
- In the comments, avoid general references to Apify Actors. Be specific to this actor. For example: "Main entry point for this Apify Actor."
- Do not raise
RuntimeError
; useApify.fail
instead. - Ensure you handle the case when the dataset is empty.
helpers.py
- I'm not a fan of one-line functions, as they make the code harder to read. For example:
def render_llms_txt(data: dict) -> str: """Renders the `llms.txt` file using the provided data.""" return render(data)
renderer.py
- Rather than concatenating strings directly, add items to an array and then use
join
to concatenate. It is more efficient.
/tests
- I generally prefer
pytest
overunittest
.pytest
has a cleaner syntax, supports fixtures, and, for me, it is easier to work with. Also, it used inpython-crawlee
, which we (at least myself) considered as Apify standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the points made by @jirispilka, and I added some nits of my own. But overall, it looks solid!
"type": "string", | ||
"description": "The URL of website you want to get the llm.txt generated for.", | ||
"description": "The URL from which the crawler will start to generate the llms.txt file.", | ||
"editor": "textfield", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use the requestListSources
editor in this case - see https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1#array. You may use apify.RequestList
to process it afterwards.
Unless you really want to support just one URL every time, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of how the format of llms.txt
is specified I think the single url input is more suitable - we treat it like an index for the whole site (or sub-site) so the url should act as a root (or sub-root).
@@ -13,6 +13,8 @@ | |||
if TYPE_CHECKING: | |||
from apify_client.clients import KeyValueStoreClientAsync | |||
|
|||
logger = logging.getLogger('apify') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use Actor.log
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Actor.log
in main.py
now, but in helpers.py
pytest prints warnings because of non existent event loop, so I will keep the logger there.
…s, render using join, switch to pytest, actor.log in main
|
README.md
Outdated
|
||
The **llms.txt Generator Actor** is an Apify tool that helps you extract essential website content and generate an **llms.txt** file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. | ||
The **llms.txt generator** is an Apify tool that helps you extract essential website content and generate an **llms.txt** file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. This tool leverages the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor to perform deep crawls and extract text content from web pages, ensuring comprehensive data collection. The Website Content Crawler is particularly useful because it supports output in multiple formats, including markdown, which is used by the **llms.txt**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The **llms.txt generator** is an Apify tool that helps you extract essential website content and generate an **llms.txt** file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. This tool leverages the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor to perform deep crawls and extract text content from web pages, ensuring comprehensive data collection. The Website Content Crawler is particularly useful because it supports output in multiple formats, including markdown, which is used by the **llms.txt**. | |
The **llms.txt generator** is Apify Actor that helps you extract essential website content and generate an [llms.txt](https://llmstxt.org/) file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. This Actor leverages the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor to perform deep crawls and extract text content from web pages, ensuring comprehensive data collection. The Website Content Crawler is particularly useful because it supports output in multiple formats, including markdown, which is used by the **llms.txt**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capitalized 'Actor' and added llms.txt org link to top description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit: tool -> Actor
changed tool -> Actor in README |
Thanks for the changes! |
PR for #2
Changes
Code Changes
url
tostartURL
for consistency with Website Content Crawler.actor_input = await Actor.get_input() or {'url': 'https://docs.apify.com/'}
to ensure the Actor fails if no value is provided.record = await kvstore.get_record(store_id) if record is None
to improve code readability."# changed by get_crawler_actor_config with defailt value 1"
.CRAWLER_CONFIG
to only set values that differ from the default.render(data: dict) -> str
function to help users understand its usage.store = await Actor.open_key_value_store()
andawait store.set_value('llms.txt', output)
to enhance UX.# TODO: use path or LLM suggestions to group pages into sections # noqa: TD003
and recommended creating an issue for future improvements.README Changes
A simple, AI-focused structure to help coders, researchers, and AI models easily access and use website content.
to include emphasis for consistency.Features of llms.txt Generator
orContent Extraction
to use proper capitalization.Open Points
get_description_from_html
is used: Sometimes the Website Content Crawler does not return meta descriptions in the dataset, even when available in the HTML. So, I extract the descriptions myself for now, but I can try to fix this issue in the crawler.llms.txt
with a real one: Added a real output example ofllms.txt
generated by the Actor fordocs.apify.com
and kept the proposed structure example on top.